The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), āpatternedā precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35). The data required for this practical session can be downloaded from Zenodo.
The gene expression count data was obtained from Kallisto (Bray et
al., 2016) using the Gencode hg38 release Homo sapiens
transcriptome.
#The gene expression count data, as mentioned in the R Markdown document, typically consists of a matrix where rows represent genes and columns represent samples (cells at timepoints). Each entry in the matrix corresponds to the count of RNA-seq reads mapped to a particular gene in a specific sample. This count data is obtained from the sequencing process and serves as the basis for downstream analysis, including differential gene expression analysis, clustering, and visualization
load("C:/Users/norar/OneDrive/Documentos/EPFL/Genomics and bioinformatics/week7/10944047/data_09_04_2024.RData")
#Data:
# myE_ge : raw gene expression count matrix
# info : sample annotation (data-frame)
# ttg : rows (genes) annotation
# Focus on CTRL samples for this session
sel_samples <- which(info$mutant=="CTRL")
myE_ge <- myE_ge[,sel_samples]
info <- info[sel_samples,]
info$group <- factor(paste(info$Fraction,info$DIV,sep="_"),levels=unique(paste(info$Fraction,info$DIV,sep="_")))
#Make some nice colors to facilitate the visualisation of time-points
mytime <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days) <- c(0,3,7,14,22,35)
mycols <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))How many samples is there in your count data? 48 samples
Does this correspond to your experimental protocol?
Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file. They do.
What are the covariates? Time, disease vs no disease, fractionation (nuclear or cytoplasmic).
how many rows? 39,235 entries?
Check that the rowID of the data-count corresponds to the ensembl gene ID in your gene annotation file.
## [1] 48
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
match(colnames(myE_ge),info$sampleID)## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
sum(is.na(match(rownames(myE_ge),ttg$ens_gene)))## [1] 0
How does your data look-like?
par(mfrow=c(2,3))
plot(density(myE_ge[,1]),main="Read count distribution in sample 1")
boxplot(myE_ge,las=1,ylab="raw gene count",main="with outliers")
boxplot(myE_ge,outline=FALSE,las=1,ylab="raw gene count",main="without outliers")
plot(density(t(myE_ge[1,])),main="Read count distribution across samples")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),las=1,ylab="raw gene count",main="with outliers")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),outline=FALSE,las=1,ylab="raw gene count",main="without outliers")
Analysis of the variance of the gene count across the \(N\) samples: \(\sigma^2=Var(X)=\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2\) and \(\mu=\frac{1}{n}\sum_{i=1}^n\)x_i:
Letās have a close look at how the variance in gene expression scale with average and then correct it with some transformation.
#calculate mean and variance of the rows
row_avg<-apply(myE_ge,1,mean)
row_var <-apply(myE_ge,1,var)
#Log-transform your count data
myE_gel <- log2(1+myE_ge)
#DESeq2 variance stabilisation
vsd <- DESeq2::varianceStabilizingTransformation(matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE))
par(mfrow=c(1,3))
#Variance scale with average --> Poisson distribution
plot(row_avg,row_var,las=1,main="RAW data",pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="mean read count",ylab="variance read count",xlim=c(0,5000),ylim=c(0,10^7))
grid()
plot(apply(myE_gel,1,mean),apply(myE_gel,1,var),las=1,main="Log2-transformed data",pch=19,col=rgb(0,0,0,0.1),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()
plot(apply(vsd,1,mean),apply(vsd,1,var),las=1,main="DESeq2 variance stabilised",pch=19,col=rgb(0,0,0,0.05),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()
Letās first have a look at the distribution of gene expression for a few samples:
par(mfrow=c(1,3))
geneplotter::multidensity(myE_gel[,c(1,2,4,10)],main="Read count distributions",las=1,xlab="read count [log2]")
plot(density(myE_gel[,1]),las=1,main=colnames(myE_gel)[1],las=1,xlab="read count [log2]")
plot(density(myE_gel[,3]),las=1,main=colnames(myE_gel)[3],las=1,xlab="read count [log2]")
We can identify reliably expressed genes by fitting a bimodal distribution to the log2-read count distribution of each samples to discriminate between the foreground and the background transcription. The limit must be fitted to each individual sample given that the library size will impact this factor.
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%
## fitting ...
## | | | 0% | |================================================= | 33% | |================================================================================================== | 67% | |===================================================================================================================================================| 100%

We can now select for the reliably expressed genes in each sample:

Letās first look at the ditribution of the gene count across a few
samples:

There are several options for normalisation. Scaling factors,
quantile normalisation etc.

What is the effect on the gene count?

Hierarchical clustering of the samples is frequently used to analyse whether similar samples cluster together.
Letās first compare the clustering of the samples using the Manhattan
distance and Ward D algorithm:

Samples are very similarly clustered in Quantile and Deseq2 normalised methods. Letās compare the different effect of the distances used as well as agglomerative alorithm. From now we will use the quantile normalised matrix:

There are many useful R packages to perform differential gene expression analysis of bulk RNA-sequencing data usge as Sleuth, edgeR and DESeq2.
#Volcano plot
VolcanoPlot <- function(myDGE = res, col_log2FC = res$log2FoldChange, col_pval = res$padj, title = "") {
mycol_genes <- rep(rgb(0.7, 0.7, 0.7, 0.2))
mycol_genes[abs(col_log2FC) >= 1.0 & col_pval <= 0.05] <- rep(rgb(0.0, 0.0, 0.0, 0.2))
plot(col_log2FC, -log10(col_pval), pch = 19, cex = 0.3, col = mycol_genes, xlab = "log2FC", ylab = "-log10(P-value)", frame = FALSE, main = title)
abline(h = -log10(0.01), lty = 2, col = "grey")
abline(v = c(-1, 1), lty = 2, col = "grey")
}
#Biological Pathway Gene Enrichment Analysis
GO_analysis <- function(genes_list){
gostres_diff <- gost(query = genes_list,
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = TRUE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", as_short_link = FALSE,sources=c("GO:BP", "GO:MF","KEGG"))
gostplot(gostres_diff, capped = FALSE, interactive = TRUE)#please note this is going to create an interactive plot
}
#Create a venn diagram from two gene lists
## This is an example - not meant to run
vennDiag <- function(genes_lists){
genes_comparisons <- do.call(what=cbind,args=lapply(genes_lists,function(Z)return(rownames(myE_gen)%in%Z)))
colnames(genes_comparisons)<-c("cond1","cond2")
vennDiagram(genes_comparisons,main="blasdfas")
}Here is a tutorial how to run DESeq2 on this data:
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35")) #These are the 6 time points
#How to fit DEseq2:
dds <- DESeq2::DESeqDataSetFromMatrix(countData = mycountData[is_expressed_global,],
colData = info,
design= ~ DIV + Fraction)
dds <- DESeq2::DESeq(dds)
mycoeff <- resultsNames(dds) # lists the coefficients
#How to extract a result (for a specific comparison: if comparison corresponds to mycoeff index = "DIV_14_vs_0":
res <- results(dds, name="DIV_14_vs_0")
genes_up <- as.character(rownames(res))[res$log2FoldChange>1.0&res$padj<0.01]
genes_do <- as.character(rownames(res))[res$log2FoldChange<(-1.0)&res$padj<0.01]Here is a tutorial how to run EdgeR with Quasi-likelihood F-tests on this data:
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
y <- DGEList(counts=mycountData[is_expressed_global,],group=info$group)
design <- model.matrix(~info$group)
y <- estimateDisp(y,design)
#To perform quasi-likelihood F-tests:
fit <- glmQLFit(y,design)
mycoefs <- colnames(fit$design)
#The fit has 12 parameters. The first is the baseline level of group 1 (Nuclear_0). The remaining are the difference between Nuclear_0 and the other groups.
#To compare Nuclear_3 vs Nuclear_0:
qlf.nuc3 <- glmQLFTest(fit, coef=2)
res<- topTags(qlf.nuc3,n=sum(is_expressed_global))
#To compare Nuclear_7 vs Nuclear_3
qlf.nuc3.nuc7 <- glmQLFTest(fit, contrast=c(0,-1,1,rep(0,9)))
res<- topTags(qlf.nuc3.nuc7,n=sum(is_expressed_global))
res <- topTags(glmQLFTest(fit, coef=3),n=sum(is_expressed_global))$table
genes_up <- as.character(rownames(res))[res$logFC>1.0&res$FDR<0.01]
genes_do <- as.character(rownames(res))[res$logFC<(-1.0)&res$FDR<0.01]Here is a tutorial how to run EdgeR with Likelihood ratio tests on this data:
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
y <- DGEList(counts=mycountData[is_expressed_global,],group=info$group)
design <- model.matrix(~info$group)
y <- estimateDisp(y,design)
#To perform likelihood ratio tests:
fit <- glmFit(y,design)
mycoefs <- colnames(fit$design)
#The fit has 12 parameters. The first is the baseline level of group 1 (Nuclear_0). The remaining are the difference between Nuclear_0 and the other groups.
#To compare Nuclear_3 vs Nuclear_0:
lrt.nuc3 <- glmLRT(fit,coef=2)
res<- topTags(lrt.nuc3,n=sum(is_expressed_global))$table
genes_up <- as.character(rownames(res))[res$logFC>1.0&res$FDR<0.01]
genes_do <- as.character(rownames(res))[res$logFC<(-1.0)&res$FDR<0.01]Your homework will be graded according to three criteria: 1) Correctness of your result; 2) Clarity of the visual output; 3) Description of your results demonstrating your ability to discuss your result in their biological context.
In this task, you will perform a differential gene expression using Deseq2.
## [1] "Intercept" "DIV_3_vs_0" "DIV_7_vs_0" "DIV_14_vs_0"
## [5] "DIV_22_vs_0" "DIV_35_vs_0" "Fraction_Nuclear_vs_Cytoplasmic"
res_1 <- results(dds, name = "DIV_3_vs_0")
res_2 <- results(dds, name = "DIV_7_vs_0")
res_3 <- results(dds, name = "DIV_14_vs_0")
res_4 <- results(dds, name = "DIV_22_vs_0")
res_5 <- results(dds, name = "DIV_35_vs_0")
# Extracting upregulated and downregulated genes for each comparison
genes_up_1 <- as.character(rownames(res_1))[res_1$log2FoldChange > 1.0 & res_1$padj < 0.01]
genes_do_1 <- as.character(rownames(res_1))[res_1$log2FoldChange < -1.0 & res_1$padj < 0.01]
genes_up_2 <- as.character(rownames(res_2))[res_2$log2FoldChange > 1.0 & res_2$padj < 0.01]
genes_do_2 <- as.character(rownames(res_2))[res_2$log2FoldChange < -1.0 & res_2$padj < 0.01]
genes_up_3 <- as.character(rownames(res_3))[res_3$log2FoldChange > 1.0 & res_3$padj < 0.01]
genes_do_3 <- as.character(rownames(res_3))[res_3$log2FoldChange < -1.0 & res_3$padj < 0.01]
genes_up_4 <- as.character(rownames(res_4))[res_4$log2FoldChange > 1.0 & res_4$padj < 0.01]
genes_do_4 <- as.character(rownames(res_4))[res_4$log2FoldChange < -1.0 & res_4$padj < 0.01]
genes_up_5 <- as.character(rownames(res_5))[res_5$log2FoldChange > 1.0 & res_5$padj < 0.01]
genes_do_5 <- as.character(rownames(res_5))[res_5$log2FoldChange < -1.0 & res_5$padj < 0.01]
# Calculate the lengths for up and down
up_counts <- c(length(genes_up_1), length(genes_up_2), length(genes_up_3), length(genes_up_4), length(genes_up_5))
do_counts <- c(length(genes_do_1), length(genes_do_2), length(genes_do_3), length(genes_do_4), length(genes_do_5))
mycols_days_subset <- mycols_days[-1]
# Create barplots for upregulated and downregulated genes with color grading
barplot(up_counts, names.arg = c("DIV_3", "DIV_7", "DIV_14", "DIV_22", "DIV_35"),
main = "Upregulated Genes", xlab = "Time Point", ylab = "Number of Genes", col = mycols_days_subset)barplot(do_counts, names.arg = c("DIV_3", "DIV_7", "DIV_14", "DIV_22", "DIV_35"),
main = "Downregulated Genes", xlab = "Time Point", ylab = "Number of Genes", col = mycols_days_subset)
As expected, we see that over time both the number of upregulated and
downregulated genes increases, since we expect that upon differentiation
more genes associated with the specialized cellās function will be
activated or not activated at all. At the first stages, the cell does
not have a function yet so there are no genes that will be more or less
expressed than others.
par(mfrow = c(2, 3))
VolcanoPlot(myDGE = res_1, col_log2FC = res_1$log2FoldChange, col_pval = res_1$padj, title = "Time Point 1")
VolcanoPlot(myDGE = res_2, col_log2FC = res_2$log2FoldChange, col_pval = res_2$padj, title = "Time Point 2")
VolcanoPlot(myDGE = res_3, col_log2FC = res_3$log2FoldChange, col_pval = res_3$padj, title = "Time Point 3")
VolcanoPlot(myDGE = res_4, col_log2FC = res_4$log2FoldChange, col_pval = res_4$padj, title = "Time Point 4")
VolcanoPlot(myDGE = res_5, col_log2FC = res_5$log2FoldChange, col_pval = res_5$padj, title = "Time Point 5")
Each day the plot explodes more, because there are more differentially
expressed genes over time. It plots the log2 fold change (effect size,
changes in gene expression) of gene expression on the x-axis against the
negative logarithm of the p-value (statistical significance) on the
y-axis.
We have to analyze the gene expression data to understand which biological processes or pathways are affected when certain genes are either upregulated or downregulated. GO_analysis applies a method for annotating genes with their biological processes, molecular functions, and cellular components (it takes a list of genes as input and retrieves GO terms associated with those genes).
As for the interpretation of the two plots we obtain, the dotsā colors and positions typically relate to the significance and enrichment of biological pathways associated with the analyzed gene set.The higher up they are in the plot, the more significant the enrichment.Therefore, the orange dots in the center represent the pathways that are most strongly associated with the genes in our dataset; while the red dots in the corners represent pathways with which the genes in our dataset do not show a significant association.
In our case, the biological pathways that our upregulated genes activate are mainly ones that carry out nervous system and general developmental functions; while the functions that are not being carried out (the most significant biological pathways for the downregulated genes, the functions that are being discarded with respect to the timepoint 0) are signaling and cell-cell communication functions. We could explain this situation by saying that at the early stages (we are at the first timepoint), the cells are still in the developmental stage; but that already from the first to the second timepoint there is a bit of differentiation, which means that some less useless functions to communicate with other cells or to ālistenā to the environment are being discarded. We could say that the cells are slowly differentiating into neurons while becoming ādeafā to external stimuli.
Do the same as task 1 but using EdgeR ā Likelihood ratio tests. Comment on the differences between the techniques.
#1) Compute the number of differentially expressed genes using edgeR:
# Define a function to perform likelihood ratio test for a given coefficient index
lrt.nuc_1 <- glmLRT(fit,coef=2)
lrt.nuc_2 <- glmLRT(fit,coef=3)
lrt.nuc_3 <- glmLRT(fit,coef=4)
lrt.nuc_4 <- glmLRT(fit,coef=5)
lrt.nuc_5 <- glmLRT(fit,coef=6)
res_1 <- topTags(lrt.nuc_1,n=sum(is_expressed_global))$table
res_2 <- topTags(lrt.nuc_2,n=sum(is_expressed_global))$table
res_3 <- topTags(lrt.nuc_3,n=sum(is_expressed_global))$table
res_4 <- topTags(lrt.nuc_4,n=sum(is_expressed_global))$table
res_5 <- topTags(lrt.nuc_5,n=sum(is_expressed_global))$table
genes_up_3_edgeR <- as.character(rownames(res_1))[res_1$logFC > 1.0 & res_1$FDR < 0.01]
genes_do_3_edgeR <- as.character(rownames(res_1))[res_1$logFC < -1.0 & res_1$FDR < 0.01]
genes_up_7_edgeR <- as.character(rownames(res_2))[res_2$logFC > 1.0 & res_2$FDR < 0.01]
genes_do_7_edgeR <- as.character(rownames(res_2))[res_2$logFC < -1.0 & res_2$FDR < 0.01]
genes_up_14_edgeR <- as.character(rownames(res_3))[res_3$logFC > 1.0 & res_3$FDR < 0.01]
genes_do_14_edgeR <- as.character(rownames(res_3))[res_3$logFC < -1.0 & res_3$FDR < 0.01]
genes_up_22_edgeR <- as.character(rownames(res_4))[res_4$logFC > 1.0 & res_4$FDR < 0.01]
genes_do_22_edgeR <- as.character(rownames(res_4))[res_4$logFC < -1.0 & res_4$FDR < 0.01]
genes_up_35_edgeR <- as.character(rownames(res_5))[res_5$logFC > 1.0 & res_5$FDR < 0.01]
genes_do_35_edgeR <- as.character(rownames(res_5))[res_5$logFC < -1.0 & res_5$FDR < 0.01]
# Calculate the lengths for up and downregulated genes for each time point
up_counts_edgeR <- c(length(genes_up_3_edgeR), length(genes_up_7_edgeR), length(genes_up_14_edgeR), length(genes_up_22_edgeR), length(genes_up_35_edgeR))
do_counts_edgeR <- c(length(genes_do_3_edgeR), length(genes_do_7_edgeR), length(genes_do_14_edgeR), length(genes_do_22_edgeR), length(genes_do_35_edgeR))
# Create bar plots for upregulated and downregulated genes with color grading
barplot(up_counts_edgeR, names.arg = c("DIV_3", "DIV_7", "DIV_14", "DIV_22", "DIV_35"),
main = "Upregulated Genes (edgeR)", xlab = "Time Point", ylab = "Number of Genes", col = mycols_days_subset)barplot(do_counts_edgeR, names.arg = c("DIV_3", "DIV_7", "DIV_14", "DIV_22", "DIV_35"),
main = "Downregulated Genes (edgeR)", xlab = "Time Point", ylab = "Number of Genes", col = mycols_days_subset)
DESeq2 (Differential Expression analysis of Sequencing data 2) and edgeR
are both widely used, they differ in several aspects:
Modeling Variability: DESeq2 uses a negative binomial distribution model to account for variability in read counts. Variability between replicates is modeled by a dispersion parameter (α), which is estimated based on the observed properties of the data. It automatically controls the amount of shrinkage towards zero for log2 fold change estimates based on dispersion and count data characteristics. EdgeR also models variability using a negative binomial distribution, but it estimates dispersion using the Cox-Reid profile-adjusted likelihood method. It allows for separate dispersions for individual genes or trended dispersion depending on gene abundance.
Dispersion Estimation: DESeq2 estimates dispersion from the data itself, adjusting the amount of shrinkage based on data properties; however, edgeR requires a user-adjustable parameter, the prior degrees of freedom, which affects the contribution of individual gene estimates to the overall dispersion fit.
Hypothesis Testing: DESeq2 uses a Wald test for hypothesis testing, where the shrunken estimate of log2 fold change is divided by its standard error. This z-statistic is compared to a standard normal distribution. Multiple testing correction is performed using the Benjamini-Hochberg procedure. As for edgeR, the technique offers the Likelihood Ratio Test and Quasi-likelihood F-test. The Quasi-likelihood F-test is preferred, especially with small replicate numbers, as it provides more robust error rate control. It identifies differential expression regardless of the magnitude of differences.
As for the differences observed when carrying out the task, we can say that the results are the basically same, although the p-values are smaller, which indicates a larger confidence in the differential expression analysis with EdgeR.
#2) Volcano plot using edgeR results:
par(mfrow = c(2, 3))
VolcanoPlot(myDGE = res_1,
col_log2FC = as.numeric(res_1$logFC),
col_pval = as.numeric(res_1$FDR),
title = "Time Point 1")
VolcanoPlot(myDGE = res_2,
col_log2FC = as.numeric(res_2$logFC),
col_pval = as.numeric(res_2$FDR),
title = "Time Point 2")
VolcanoPlot(myDGE = res_3,
col_log2FC = as.numeric(res_3$logFC),
col_pval = as.numeric(res_3$FDR),
title = "Time Point 3")
VolcanoPlot(myDGE = res_4,
col_log2FC = as.numeric(res_4$logFC),
col_pval = as.numeric(res_4$FDR),
title = "Time Point 4")
VolcanoPlot(myDGE = res_5,
col_log2FC = as.numeric(res_5$logFC),
col_pval = as.numeric(res_5$FDR),
title = "Time Point 5")